- Goals of model selection and regularization
- Subset selection
- Ridge regression
- Lasso regression
Note: although we talk about regression here, everything applies to logistic regression as well (and hence classification).
10/8/2020
Note: although we talk about regression here, everything applies to logistic regression as well (and hence classification).
When building a regression model remember that simplicity is your friend. Smaller models are easier to interpret and have fewer unknown parameters to be estimated.
Keep in mind that every additional parameter represents a cost!
The first step of every model building exercise is the set of the universe of variables to be potentially used. This task is entirely solved through your experience and context specific knowledge:
With a set of variables in hand, the goal now is to select the best model. Why not include all the variables?
Big models tend to over-fit and find features that are specific to the data in hand, i.e. not generalizable relationships.
The results are bad predictions and bad science!
In addition, bigger models have more parameters and potentially more uncertainty about everything we are trying to learn.
We need a strategy to build a model in ways that account for the trade-off between bias and variance: subset selection, shrinkage, dimension reduction.
Recall the linear model
\[Y = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p + \varepsilon\] We will here focus on improving the linear model by using some other model-fitting strategy beyond ordinary least squares (OLS).
Why moving beyond OLS?
There are 3 options:
The last two options are known as “stepwise regression”: they build a regression model step-by-step and they are computationally efficient.
The idea is very simple: fit all possible models and compare their performance based on some criteria (measure the generalization error of each one on a testing set).
Issues?
Another way to evaluate a model is to use Information Criteria metrics which attempt to quantify how well our model would have predicted the data (regardless of what you have estimated for the \(\beta_j\)’s).
We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting.
Bottom line: these are only useful if we lack the ability to compare models based on their out-of-sample predictive ability!
AIC: criterion defined for a large class of models fit by maximum likelihood \[AIC = \underbrace{- 2 \log L}_\text{Deviance} + 2d\] where \(L\) is the maximized value of the likelihood function for the estimated model.
In the case of linear model with Gaussian errors, maximum likelihood and least squares are the same thing, and \(C_p\) and AIC are equivalent
BIC (Bayes Information Criterion): based on a “Bayesian” philosophy of statistics \[BIC = \frac{1}{n}(RSS + d \hat{\sigma}^2 \log n)\]
BIC replaces the \(2 d \hat{\sigma}^2\) used in \(C_p\) with \(d \hat{\sigma}^2 \log n\) term, where \(n\) is the number of observations
Since \(\log n > 2\) for \(n > 7\), the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than \(C_p\)
Remember \(R^2\), measure of goodness of fit? \[R^2 = 1 - \dfrac{RSS}{TSS}\]
Problem: since \(RSS\) is based on the training set, it can only decrease the more variables we add! Therefore \(R^2\) can only increase for more complex models!
For a least squares model with \(d\) variables, the adjusted \(R^2\) statistic is calculated as: \[R^2_{adj} = 1 - \dfrac{RSS/(n-d-1)}{TSS/(n-1)}\]
Unlike \(C_p\), AIC and BIC, here a large value of \(R^2_{adj}\) indicates a model with a small test error.
Maximizing the \(R^2_{adj}\) is equivalent to minimizing \(RSS/(n-d-1)\). While \(RSS\) always decreases as the number of variables in the model increases, thee term \(d\) in the denominator corrects to penalize complicated models.